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ABSTRACT 


Message hierarchies in web discussion boards grow with new 
postings. Threads of messages evolve as new postings focus 
within or diverge from the original themes of the threads. 
Thus, just by investigating the subject headings or contents 
of earlier postings in a message thread, one may not be able 
to guess the contents of the later postings. The resulting 
navigation problem is further compounded for blind users 
who need the help of a screen reader program that can pro- 
vide only a linear representation of the content. We see 
that, in order to overcome the navigation obstacle for blind 
as well as sighted users, it is essential to develop techniques 
that help identify how the content of a discussion board 
grows through generalizations and specializations of topics. 
This knowledge can be used in segmenting the content in 
coherent units and guiding the users through segments rel- 
evant to their navigational goals. Our experimental results 
showed that the segmentation algorithm described in this 
paper provides up to 80 — 85% success rate in labeling mes- 
sages. The algorithm is being deployed in a software system 
to reduce the navigational load of blind students in access- 
ing web-based electronic course materials; however, we note 
that the techniques are equally applicable for developing web 
indexing and summarization tools for users with sight. 
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H.5.4 [Information Interfaces and Presentation]: Hy- 
pertext /Hypermedia— Navigation, user issues; H.5.1 [Infor- 
mation Interfaces and Presentation]: Multimedia In- 
formation Systems—Hypertezrt navigation and maps; H.3.3 
[Information Storage and Retrieval]: Content Analy- 
sis and Indexing—Abstracting methods, indexing methods; 
K.4.2 [Computers and Society]: Social Issues— Assistive 
technologies for persons with disabilities 
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buzz proj. Vander, Ryan Tue May 25, 2004 9:21 am 
Re: buzz proj. True, Thomas Thu May 27, 2004 7:53 pm 
Re: buzz proj. Vander, Ryan Sat May 29, 2004 2:08 pm 
Re: buzz proj. Grain, Robert Sun May 30, 2004 6:10 pm 
Re: buzz proj. Vander, Ryan Sun May 30, 2004 10:23 pm 
Assignment 4 Rodriguez, Luisa Thu May 27, 2004 3:04 pm 
Report for Assig. 4 True, Thomas Thu May 27, 2004 7:57 pm 
Re: Report for Assig. 4 Candan, Kasim Mon May 31, 2004 12:07 am 
Assignment #4 Atilla, John Fri May 28, 2004 10:41 pm 
Re: Assignment #4 Candan, Kasim Mon May 31, 2004 12:19 am 
Questions on #4 Roosewelt, Daniel Sat May 29, 2004 11:00 pm 
Re: Questions on #4 Candan, Kasim Mon May 31, 2004 12:23 am 
Re: Questions on #4 Ray, Luisa Mon May 31, 2004 10:34 pm 
Re: Questions on #4 Home, Chris Tue Jun 1, 2004 12:23 am 
Report Length True, Thomas Tue Jun 1, 2004 11:39 am 
Re: Report Length Candan, Kasim Wed Jun 2, 2004 1:39 am 
Assignment # 4 Bird, Sarah Tue Jun 1, 2004 9:14 pm 


Figure 1: A hierarchy of messages posted to a course 
discussion board: although the subject headers of 
the messages can give some idea about what the 
postings are about, they provide little information 
to help differentiate the actual contents of different 
messages 


1. INTRODUCTION 


Complex web sites continue to proliferate, as web-based 
information infrastructures become integral parts of educa- 
tional, corporate, and e-commerce organizations. Yet, due 
to the continuously increasing sizes and complexities of these 
infrastructures, it is also becoming more and more difficult 
for users to understand and navigate through such sites. 
The navigation problem is especially critical for users with- 
out sight. With the passage of the 508 web accessibility 
mandate, many companies and federal government agencies 
are required to follow accessibility guidelines when design- 
ing web sites. Such guidelines are very effective when design- 
ing mostly static and non-individualized information outlets. 
However, when 

e the material being delivered is information rich yet ar- 
bitrarily structured, 

e the content is dynamically generated through multiple 
users’ inputs, interactions, and annotations, or 

e the users have to follow non-linear, individualized path- 
ways through the material, 


the navigational challenge is compounded, even for users 
with sight. Unfortunately, these characteristics are very 
common in online course servers and discussion boards. 


Courses 


> DISCUSSION 


General Discussion 


This forum is for open discussion of course [ 107 Messages ] 
material. If you have any questions related to the [ All read ] 
course material, please post it here. Also, feel 

free to answer the questions that are posed by 

your classmates, if you know the answer 


You are strongly encouraged to (read this as 
"you have to”) follow the contents of this 
discussion group 


@ Course Map 


Course Contents 


did on any book to expand 


SAA communication 


È sang E-mail 


Figure 2: Two sample views from course pages containing announcements, course documents (e.g. lecture 
notes), course information (e.g. syllabus), assignments, external links, group pages, and discussion boards. 
In this paper, we focus on providing access to discussion boards, such as those included in these samples 


1.1 Motivation 


Like many others, ASU’s educational web site’ hosts course 
home pages, containing lecture notes, a syllabus, assign- 
ments, project material, course related documents, announce- 
ments, external links (links to materials residing in different 
hosts or different locations in the course server), grades, cal- 
endars, group pages, and discussion boards (Figure 2). Some 
of this content is fixed, meaning that it does not change 
during a semester (e.g. course syllabi), but majority of the 
content evolves (e.g. discussion boards) through contribu- 
tions by the instructors, teaching assistants, and students. 
Our students without sight emphasized that, although the 
screen reader software enables them to access the electronic 
material, they still have to struggle when accessing richly- 
structured, heterogeneous, and constantly growing content, 
such as discussion boards (Figure 1). With the aim of re- 
ducing the navigational load of blind students, we are de- 
veloping a software interface, called iCare-Assistant, that 
provides context- and task-dependent navigational guidance 
when accessing on-line educational materials that are al- 
ready available for the use of sighted students. State-of-the- 
art browser-based interfaces [33] and existing navigational 
helps, such as site maps and visual cues [26], alleviate this 
load for only sighted users and are generally not applicable 
to dynamically growing content. Instead, we employ trans- 
parent guidance and dynamic adaptation techniques [22, 23] 
in iCare-Assistant to help students without sight. Such dy- 
namic adaptation and guidance requires an understanding 
of the inherent, but implicit, structures of the content avail- 
able at the educational web sites. In this paper, we focus 
on the challenge of identifying coherent information units 
(or segments) in dynamically growing hierarchical content 
in discussion boards. 


1.2 Problem Statement: Topic Segmentation 
of Dynamically Growing Hierarchical Web 
Content (Discussion Boards) 

Unless a hierarchy corresponds to a well-defined concep- 
tual structure, it does not present information effectively: if 
the structure is not self-revealing, higher level nodes in the 


'myasucourses.asu.edu, implemented using the Black- 
board software [1] 


323 


Posting by a student 


to solve the question 5 in assignment 4? 


What material do we need to read | 


Reply by teaching assistant 


~<+—question 5 


Chapter 5. | 


Unique context 


Reply by instructor 


join —> 


BTW, please focus on the subsection 
on join algorithms 


Figure 3: A chain of three messages: The messages 
are too short and incomplete for indexing: they 
obtain their context from their relevant ancestors. 
Once it is identified that these three messages are 
within the same context, keywords can be inherited 
between these messages for proper indexing. 


hierarchy cannot direct users to the information available 
at the lower levels. This is the case for message hierarchies 
in discussion boards which grow freely through postings of 
different users at different times: for instance, a posting 
containing a question may lead to new postings that are not 
necessarily directly related to the original question. Thus, 
just by looking at the subject headings or contents of the 
first few postings in a message hierarchy, one may not be 
able to guess the actual contents of the replies deeper in 
the same thread (Figure 1). This complicates the task of 
navigating within message hierarchies in discussion boards. 

While storing personal (already-read email messages) for 
reuse, as in Microsoft’s Stuff I’ve Seen [24], contextual cues, 
such as time and author, can be used to search for and 
present information. However, in a discussion board, where 
the content is freely growing through multiple users’ inputs, 
interactions, and annotations, such contextual cues may not 
be enough (Figure 1). In order to provide proper naviga- 
tional support to users, a guidance system must identify, as 
precisely as possible, the next possible step(s), based on the 
current navigational context. When the context changes, 
the system should adapt to this change by identifying the 


Message Hierarchy 


Web Privacy /segmentatio 


roblems 


(new topic) 


(generalization) (specialization) 

Figure 4: An example showing three types of topic 
divergences in a message hierarchy: the original dis- 
cussion theme of “web segmentation” leads to a new 
discussion topic (“web privacy”), a more general 
discussion on the topic of the “segmentation prob- 
lems”, and a more specific thread on the ‘topic seg- 
mentation” issues 


most suitable content that has to be brought closer to the 
user in the navigational space [22, 23]. Such dynamic adap- 
tation of the information space requires an indexing system 
which can leverage the logical relationships between vari- 
ous contents, such as messages that refer to the same as- 
signment within the same context. Most messages, on the 
other hand, are too short to be meaningful by themselves, 
and therefore, they obtain their context from their parents 
and ancestors (Figure 3). However, as a discussion hierarchy 
grows through posting of new messages, its content and con- 
text will also evolve and possibly diverge from the original 
posting (Figure 4). Although not all postings will cause a 
divergence from the initial theme, some of the postings will 


e focus on a specific aspect of the original message, 
e take the discussion to a more general platform, or 


e diverge significantly from the original theme, introduc- 
ing an entirely new discussion theme. 


In a loosely structured environment, where the structure it- 
self is not known, topic distillation [4, 6, 29] and web site 
summarization [13] algorithms are useful in understanding 
the underlying structure. In linearly authored (such as text) 
documents, linear text segmentation techniques [19, 20] can 
be useful in identifying coherently authored components. 
However, in freely (and arbitrarily) evolving message hier- 
archies in discussion boards, the challenge is not to identify 
how a document is authored, but to discover how the dis- 
cussion topics have evolved and how they can be segmented 
to identify context (topic) boundaries to facilitate indexing, 
retrieval, ranking, and presentation of appropriate informa- 
tion units (or segments) to the user. 

Thus, the segmentation problem within this context 
can be defined as searching for special nodes — which are 
the entry points to new, general, or specific topics — within 
a single hierarchy of dynamically evolving web content (Fig- 
ure 4). Once the segmentation is completed, each segment 
can be independently indexed, keyword can be inherited 
(bottom-up or top-down) based on the generalization and 
specialization behaviors, and users can be directed to the 
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Figure 5: Topic segmentation of a discussion hierar- 
chy 


entry point of the most relevant segment to their current 
context. 


1.3 Contributions of this Paper 


In our previous work, we explored web indexing and min- 
ing of web information units [31, 32], mining document as- 
sociations [11, 12, 13], structural mining of hierarchical con- 
tent [10], and summarization of web sites [13] for sighted 
people. In this paper, we build on our existing work by de- 
veloping segmentation (Figure 5) techniques for discovering 
the topic evolution structures of dynamic and hierarchical 
web-content, such as discussion boards, for effective index- 
ing and presentation. We develop algorithms for identifying 
how the topic content of a discussion board evolves through 
generalizations and specializations as well as introduction of 
new topics. This knowledge is used in identifying coherent 
segments of the discussion content. With a precise under- 
standing of the structure of the available discussion content, 
it could then be possible to fully utilize the context, access 
history, and user preferences in locating the appropriate dis- 
cussion segment and presenting it to the user”. As described 
above, these algorithms are being developed to be used in 
the iCare-Assistant software for blind students in accessing 
web-based electronic course materials. However, we note 
that the techniques are equally applicable for developing web 
summarization tools for users with sight. 


2. RELATED WORK 


In this section, we present the related work in the domains 

of topic segmentation, distillation, topic tracking, adaptive 
hypermedia, video segmentation, adaptive and assistive web 
technologies, and web community mining. 
Topic Segmentation: The idea of topic segmentation has 
been applied to full-text documents in order to obtain small 
and coherent documents which can be used as visualiza- 
tion aids [15, 35]. Since the focus of this research has been 
the segmentation of text documents, underlying techniques 
have been borrowed from the text segmentation [19, 20] lit- 
erature. The main difference between the text segmentation 
and discussion board segmentation is that, while text docu- 
ments usually present a coherent (authored) linear structure 
that can be exploited for segmentation, discussion boards 
evolve through (mostly short) postings by many contribu- 
tors. Thus, linear text segmentation [19, 20] techniques are 
not directly applicable in this domain. 


? The segment indexing and presentation techniques are out- 
side of the scope of this paper. 


Topic Distillation: Hypermedia has two aspects: con- 
tent and structural information. Web structures can be 
used as clues while indexing and presenting content. Var- 
ious techniques have been proposed to use the web struc- 
ture in identifying document associations, such as the com- 
panion and co-citation algorithms proposed by Dean and 
Henzinger [21]. One approach to organizing web query re- 
sults based on available web structure is topic distillation 
proposed in [29]. This technique organizes topic spaces as 
a smaller set of hub and authoritative pages, and thus, it 
provides an effective mean for summarizing query results. 
[4] improved the basic topic distillation algorithm presented 
in [17] through additional heuristics. [6] further considers 
page fanout in propagating scores. Topic distillation has 
been used by many search engines, including Google, IBM 
Clever [17], and TOPIC [9]. Note that topic distillation [4, 
6, 29] could be a natural choice for summarization purposes. 
However, these techniques are usually general purpose and 
ignore the special hierarchical and dynamic structure of the 
web content, such as discussion boards. The techniques, we 
develop in this paper, on the other hand, exploit these two 
inherent features to establish the underlying segmentation 
framework. 

Topic Tracking: Like the topic distillation work described 
above, topic detection and tracking (TDT) research [3, 5, 37, 
40], which mainly focuses on detecting and tracking events 
in streaming news data, is related to the work presented 
in this paper. TDT systems monitor continuously updated 
news stories and try to detect the first occurrence of a new 
story; i.e., an event significantly different from those news 
events seen before. To detect the first story, current TDT 
systems compare a new document with the past documents 
and make a decision regarding the novelty of the story based 
on the content-based similarity values. For example, the 
method proposed in [5] is based on an incremental TF-IDF 
model, and it involves segmentation of documents to locate 
all stories on a previously unseen (new) event in a stream 
of news stories. In contrast, the naturally evolving nature 
of discussion threads and the need for fine-granularity seg- 
ment boundary identification make the problem of topic seg- 
mentation significantly harder than the new-event detection 
problem addressed by the TDT technologies. 

Adaptive Hypermedia: Adaptive hypermedia is a rich re- 
search field that dates back to the early 1990s [9]. Adaptive 
hypermedia uses two different but complementary methods, 
namely adaptive presentation and adaptive navigation [9]. 
Adaptive presentation is manipulation of content fragments 
in a hypertext document. Order of fragments can be changed, 
or fragments can be made invisible or less visible within a 
page. Stretchtexts, where text fragments can be stretched 
or shrunk on the basis of user interests, are also used. Adap- 
tive navigation, on the other hand, is the manipulation of 
links. Direct guidance, link sorting, link hiding, link anno- 
tation, link generation, and map adaptation are the tech- 
niques used. Detailed discussion of all these approaches, 
both for adaptive presentation and adaptive navigation, can 
be found in [7, 8, 9, 14, 16]. Researchers in the AI com- 
munity have developed web navigation tour guides, such as 
WebWatcher [28]. WebWatcher utilizes user access patterns 
in a particular web site to recommend users proper naviga- 
tion paths for a given topic. User access patterns can also be 
incorporated into the algorithms we present in this paper. 
[30] presents a technique for constructing multi-granularity 
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Figure 6: Special case: segmenting a single root-to- 
leaf chain 


and topic-focused site maps. Their technique can help in 
visualizing the topology of the web site; thus, it supports 
navigation. Nonetheless, most of these approaches exploit 
visual cues as they are designed to help sighted individuals. 
Video Segmentation: Video segmentation literature [27, 
34, 36, 37] is also relevant to the work presented in this 
paper. In video, shot (or segment) boundaries are usu- 
ally detected by comparing various features of consecutive 
frames or neighborhoods of frames to identify major content 
changes [27]. Thus, spatio-temporal continuity of common 
features (e.g. objects, color histograms) shared between two 
consecutive frames increase the likelihood that these two 
frames are part of a single coherent segment. As we discuss 
in the next section, in messaging systems, the varying (but 
short) sizes of the messages and arbitrarily used (intended, 
forgotten, or implied) quotations from the ancestor messages 
further complicate the detection of segment boundaries. 
Adaptive and Assistive Technologies: Technologies re- 
lied upon by the users with visual impairments include screen 
readers, screen magnifiers, voice recognition software, hy- 
permedia to hypertext transformers, and refreshable Braille 
displays. State-of-the-art browser-based interfaces [33] and 
navigational helps [26] mostly rely on visual guidance, which 
is not useful for users who are blind. In this paper, we do 
not focus on specific adaptive technologies exploited to make 
educational sites accessible [22, 23]. Instead, we present 
the underlying enabling technology of topic segmentation 
for discussion boards. 

Web Community Mining: Web communities, such as 
discussion boards and Usenet, are places where people freely 
participate in discussions. Even though web communities 
contain a lot of human knowledge, many search engines 
which have been successful for general purpose web data 
do not apply well because they ignore inherent structures of 
the web communities and furthermore postings are usually 
short. [38] creates a specialized ranking function for Usenet 
by using linear regression and support vector machine tech- 
niques. Their approach is based on metadata, such as prior 
knowledge about the message author or the depth of the 
message. They do not address short message problem. [39] 
suggests a method to extract information from web discus- 
sion boards and email archives by summarizing threads. To 
extract a thread summary, they use quote and comment re- 
lationships, which indicate there are topic bindings between 
messages. 


Input: individual nodes on a chain of messages 


step1(low grad 
new topic? 


segmentation) 


y 


step2(fine grad. segmentation) 
same, specific, or general? 


y 


new same spec. general 


Figure 7: Two step segmentation of a message 


3. SEGMENTATION OF MESSAGE 
HIERARCHIES 


Once a hierarchy of messages is segmented as in Figure 5, 
each segment can be treated as an atomic entity (for in- 
stance, if keyword vectors are used for indexing, such vec- 
tors can be extracted for the collection of messages in a given 
segment), or a key message (for instance, the first message in 
the segment) can be chosen to represent the segment. Sim- 
ilar techniques are used in shot (or segment) identification 
and indexing [18, 25, 27, 34, 36] in linear video streams. 
Therefore, before tackling the problem of segmentation of a 
hierarchy of messages, we first focus on the special case of 
segmentation of a single root-to-leaf chain (Figure 6) in the 
hierarchy. In Section 3.2, we will extend this for the general 
case to the segmentation of entire hierarchies. 


3.1 Segmenting a Single Message Chain 


The approach of segmentation of a sequence of documents 
was effectively utilized in detecting news stories about a pre- 
viously unseen event in a stream of news stories [5]. The seg- 
mentation technique used (comparing each new document 
with all, or a carefully selected few, of the previously seen 
documents to identify in which cluster they belong) is good 
when the goal is to identify if a document is content-wise 
similar to a previously seen group of documents. However, 
when the required segmentation is of finer quality, as in try- 
ing to identify whether the topic of the current message is 
more specific or more general than the topic of its ancestors, 
such a comparison is not sufficient. Therefore, in this paper, 
we propose a two-step approach to segmentation (Figure 7): 
we process the message nodes in a chain in a top-down man- 
ner; for each node, 


e first, we perform a low-granularity segmentation to 
identify whether the message is of an unrelated topic 
(relative to the postings immediately before it in the 
same thread) or not; 


e in the second step, if the message is identified to be 
similar to the previous messages, a higher-granularity 
segmentation process, which tries to determine whether 
the message is more specific or more general than the 
previous messages, is carried out. 


In this section, we discuss these two steps in detail. 
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Message 1: Quotation (left at the bottom of the message) 
does not provide context 


Thanks a lot...BTW, is it possible that 

two different nodes have the same ad-labels? I found in the 
data file produced by JK’s code, there exists two different 
nodes with the same ad-labels and different ses-labels! 


Quote> At this stage there is no planner; we simply pick a 
Quote> sequence of MI joins, such that at each stage results 
Quote> from one operator join with results from another one. 


Message 2: Quotation (this time intentionally kept at the 
top, before the reply text) is included to provide context to 
the reply 


two different nodes have the same ad-labels? I found in the 
data file produced by JK’s code, there exists two different 
nodes with the same ad-labels and different ses-labels! 


Quote> 
Quote> 
Quote> 


..this is curious... 
regarding this case? 


can you please give us more details 


Figure 8: The quotations in Message 1 do not pro- 
vide a common context, whereas the quotations in 
Message 2 do provide a common context 


3.1.1 Step I: Identifying New Topic Boundaries 


In this step, we identify whether the current message is 
sufficiently different from the previous postings in the same 
thread to be marked as a new topic. Unlike in a stream 
of news documents [3, 5, 37], where different news may be 
interleaved in a given sequence, in a given thread of a dis- 
cussion board, there is a natural tendency of maintaining 
the same topic because most postings are replies to previ- 
ous ones. Thus, unlike the previous work on TDT, a new 
node does not need to (and cannot) be compared to all its 
ancestors, but has to be compared to its immediate parent 
(or an immediate sequence of ancestors) as it is (they are) 
causally closest to the current node. 

A similar approach of comparing the features of a frame 
locally with its immediate predecessors works well in identi- 
fying shot boundaries in video streams [27]. When compar- 
ing two consecutive video frames, any of the common fea- 
tures (e.g. objects, color histograms) shared between them 
increases the likelihood that these two frames are part of a 
single shot (a coherent segment). However, when segment- 
ing discussion threads, there are certain complications: 


e First, unlike consecutive video frames that are mostly 
identical, consecutive messages of the same topic may 
be of different length, style, and content. 


e Secondly, in many messaging systems, original post- 
ings are automatically included in replies as quota- 
tions; hence, unless quotations are used in a way to 
strengthen the link between the original message and 
the reply, they may not highlight a common context 
(Figure 8). 


Thus, keywords in quotations should be treated differently 
based on the relevance of the quotations as determined by 
their placement in the message; in general, quotations se- 
lectively used within the body of a message (Message 2 in 
Figure 8) are more relevant than the quotations left (poten- 
tially forgotten) as a bulk at the end of a message (Message 1 
in Figure 8). In this paper, we do not focus on the problem 


of identifying selectively-used quotations, instead we focus 
on the impact of quotations on the segmentation task. 

In general, keywords in quotations can be considered as 
keywords inherited from the ancestors. By including them 
in the keyword vector of a message, we can implicitly in- 
crease the similarity between the current message and the 
quoted message. However, keywords in quotations have to 
be treated differently than the other keywords to prevent 
undeserved bias. Let us represent each message in the hi- 
erarchy as a keyword weight vector (wi, w2,...,Wn). The 
weight, wi, of the keyword k; is computed using the aggre- 
gate frequency of the keyword, 


2 


1<d<quot-depth 


freqi = freqi,o + imp(d) x freqi,a 


where 


e freqi o is the frequency of the keyword k; in the mes- 
sage excluding the quotations, 


e freqi a is its frequency in d-level quotations (quota- 
tions from the parent message, as in Figure 8, are of 
1-level), and 


e imp(d) is the impact factor of the quotations that are 
of depth d. 


Thus, the contribution of the quotation keywords to the 
overall frequency of the term varies based on the value of the 
corresponding impact factor. Impact factors greater than 1 
imply that the resulting keyword vector will have a higher 
similarity to the ancestor from which the quotations have 
been taken, whereas impact factors less than 1 imply that, 
although quotations are important, the actual content of the 
message should be used for determining whether the mes- 
sage is similar to the ancestors or not. Note that, even when 
the impact factor is only 1, the existence of the quotation 
keywords in the message gives bias towards increased simi- 
larity between the ancestors and the message. 

Once a keyword weight vector is computed for a message, 
the cosine similarity between this vector and the keyword 
vector of the parent message (or the keyword vector repre- 
senting the segment being computed so far) can be used to 
classify the input message as having a new topic or being of 
the same topic as that of the parent. Other similarity and 
distance measures, such as Hellinger distance and Kullback- 
Leiber divergence are also shown to work well in the TDT 
domain [3, 5, 37]. 


3.1.2 Step II: Segmentation based on Specialization/ 
Generalization 


If a message on a given chain is identified to introduce a 
new topic to the discussion, this message can be used as a 
segment boundary. On the other hand, if the difference be- 
tween the message being considered and the earlier messages 
is not large enough to trigger segmentation, then an initial 
segmentation is not possible. However, even though a mes- 
sage may not diverge significantly from the initial theme, it 
may 

e focus on a specific aspect of the common theme or 
e take the discussion to a more general platform. 


Finding such specialization and generalization boundaries 
is also important because understanding when a discussion 
topic diverges helps both with indexing (by choosing the 
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Figure 9: Visual representation of the contents of 
two (parent/child) messages of the same topic: the 
two messages share a common base (a common con- 
text), but they also have their own content 


right keyword weights for the given segment) as well as guid- 
ing the user (without sight) to the most appropriate entry 
point within a discussion. Therefore, in this step, among 
the parent/child messages that are identified to be of the 
same topic, we need to detect specialization and generaliza- 
tion boundaries. For this purpose, we first need to define 
the terms specialization and generalization. 

In general, as shown in Figure 9, given two messages, A 
and B, of the same topic, they will have a common base, 
while both messages will also have their own content, dif- 
ferent from their common base. If the common base, C, 
of these two messages can be identified, then the degree of 
specialization can be defined as 


spec(A, B) = 1 — similarity(A, CB), 


where Cg is the content in message B corresponding to the 
common base with A. Intuitively, given two messages, A and 
B, that are already identified as being of the same topic, if 
the original message, A, is not similar to the common base 
of the two messages, it means that the common base is a 
small part of the original message; i.e., the new message 
specializes within the original message. 

Similarly, given the common base between A and B, the 
degree of generalization can be defined as 


gen( A, B) = 1 — similarity(B,Ca), 


where C'a is the content in message A corresponding to the 
common base with B. Again, intuitively, given A and B of 
the same topic, if the new message, B, is not similar to the 
common base of the two messages, this would mean that the 
common base is a small part of the new message; i.e., the 
new message generalizes on the original message. 

Unfortunately, in practice, identifying the common base 
of two messages and computing the specialization and gen- 
eralization degrees by comparing the two messages to this 
common base are not trivial tasks. We use the quotations 
from previous messages to help us with this process. Thus, 
we fragment each message on a discussion board into zero 
or more anchored parts and a free part. An anchored part of 
a message is composed of the quotation messages from the 
parent and ancestors as well as the parts of the message iden- 
tified to be replies to these quotations. For instance, Mes- 
sage 2 in Figure 8 is composed of a quotation-reply pair; in a 
sense, the quotation message is a context-providing pointer 
to the ancestor, which can be used to improve the accu- 
racy of segmentation. The free part of a message is the 
part which is not immediately associated with the parent or 
ancestor quotations. 

For the anchored parts of a message, if quotations from 
the parent or ancestors are used as context-providing point- 


ers, then the degree of specialization or generalization should 
be defined within the associated context. Taking this into 


Table 1: Weights used to measure undifferentiated, 
low-only, and differentiated errors 


account, we define the degree of generalization as 


1 


De 


w | free X gen(Dpar,dfree) + Nanch,i X gen(qi, di) 
dj €anchored 


Undiff. Low-only Diff. 
N |S | SF | SG N | S| SF | SG N S SF | SG 
N 1 1 1 0 1 1 1 0 1 1 I 
S 1 0 a 1 1 0 0 0 1 0 0.5 | 0.5 
SF 1 1 0 1 1 0 0 0 1 | 0.5 0 0.5 
SG 1 1 1 0 1 0 0 0 1 | 0.5 | 0.5 0 


where 
e N is the number of keywords in the message, 


e dfree is the free part of the message and nfree is the 
number of keywords in this part, 


qi, di is the i” anchored quotation-reply pair and Nanch,i 
is the number of keywords in this pair, and 


e Doar is the parent message. 


Note that, while the free part of the message is compared 
directly against the parent (assuming that the parent, which 
is of the same topic, provides the context), the anchored 
components are compared against the corresponding context 
as highlighted by the quotation. The degree of specialization 
is defined similarly: 


1 


x (rr x spec(Dpar, dfree) + 


d;,€anchored 


Finally, once the degrees of generalization and specializa- 
tion are computed for given two messages, A and B, if 
gen(A, B) > Og, for a given generalization threshold, 0,, 
then B is marked as a generalization boundary. When this 
is not the case, if spec(A, B) > Os, for a given specialization 
threshold, @;, then B is marked as used as a specialization 
boundary within the same topic. If neither of these cases is 
true, then B and A are said to be in the same topic segment. 
Note that, although we do not elaborate in this paper, these 
threshold values need to be set through a learning process 
which identifies proper thresholds based on a given train- 
ing sample. Nevertheless, in Section 4, we experimentally 
evaluate the effects of different threshold values on the per- 
formance of the algorithm. 


3.2 Segmenting a Hierarchy of Messages 


Once we establish the techniques for segmenting a root- 
to-leaf chain in a given message hierarchy, extending these 
for achieving the segmentation of the entire hierarchy is 
straight-forward. Since two separate replies to a single mes- 
sage are independently created from each other, they cannot 
be marked to be of the same topic, unless they are indepen- 
dently identified to be of the same topic as that of their 
common parent. Thus, the two-step segmentation process 
described above can be repeated in a top-down fashion, fol- 
lowing each chain of the hierarchy independently. Finally, 
each connected component of the tree, not split with seg- 
ment boundaries, is marked as an atomic segment and in- 
dexed separately, while the specialization and generalization 
information is used to identify how keywords are inherited 
between ancestors and descendents?. The common ancestor 
of all nodes in a given segment is identified as the entry point 
of the segment and used in guiding users. 


3The details of the indexing process of the segments are out 
of the scope of this paper. 


Nanch,i X spec(qi, 1) ; 
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Table 2: The weighted success rate for the proposed 

algorithm is greater than 79%, even for the undif- 

ferentiated scheme, where all errors are counted 
Success rate (%) | 

Undiff. | Low-only Diff. | 


79.06% 87.31% 83.19% | 


4. EXPERIMENTAL EVALUATION 


In order to evaluate the effectiveness of the segmenta- 
tion techniques presented in this paper, we performed a user 
study and compared the segmentation feedback provided by 
assessors of a discussion board with the segmentation results 
obtained by the proposed algorithm. 

Setup: For the evaluations presented here, due to the di- 
versity of its postings and message hierarchies, we used the 
movie message board available at [2] as the message data 
source. We randomly selected 

e 20 discussion threads, with 
a total of 368 messages, 
average thread depth of 12.45, 
average quotation depth of 1.3 (86% of the total of 
5241 quotations are from the parent) 


from this source and asked 5 users to assess each message to 
label it with N for new topic, S for same topic as the parent, 
SF for specialization (or focussing), or SG for generaliza- 
tion. Given all manual labelings from multiple assessors, 
we took the majority label to denote the page’s relationship 
with its parent. We then compared these manual labeling 
results with the labels assigned by the proposed automated 
segmentation algorithm (which took only 560ms to segment 
the given 20 threads). In this section, we report the results 
when the threshold for detecting new segment boundaries is 
set to 0.35, generalization threshold, O,, is set to 0.6, and 
the specialization threshold, Os, is set to 0.8 (we discuss 
the effects of varying these thresholds later in the section). 
Also, for the results presented here, the impact factor for 
the parent quotations (d = 1) is imp(d) = 4 (we discuss the 
effect of different impact factor values in the later section). 
Evaluation criteria: In order to observe the effectiveness 
of the proposed algorithms, we computed a labeling success 
rate (or precision), 


X memessages | — error_weight(m) 


success_rate = 
number of messages 


x 100, 


where error weights are used to account for gravity of the 
error in the computed success rate. We experimented with 
three different schemes as shown in Table 1: 


e Undifferentiated weights: Weights in first partition of 
the table mark all errors with the same (maximum) 
error weight, independent of the type of error. 

e Low-only weights: Weights in second partition in the 
table only count errors in the first, low-granularity, 
step of the algorithm; i.e., only 


Table 3: Distribution of various types of errors 


| Alg.\ User || New-u | Same-u | Spec.-u | Gen.-u || Tot. 
| New-a — 31.0 1.4 7.0 39.4 
| Same-a 14.1 = 16.9 11.3 42.3 
| Spec.-a 0.0 4.2 — 0.0 4.2 
| Gen.-a 7.0 5.6 1.4 — 14.1 
| Total 21.1 40.9 19.7 18.3 100 


Table 4: User labelings for the 368 messages in the 
randomly selected 20 threads 


| New | Same | Spec. | Gen. | No Majority (unlabeled) || Tot. 


[ 58 206 39 36 29 368 


— those pages that are marked erroneously as being 
of a new topic or 

— those that should have been marked as a new 
topic, but not marked as such 


count towards the error rate. 

e Differentiated weights: Weights in third partition in 
the table penalize different error types differently. More 
specifically, errors within the high-granularity group 
(S, SF, and SG) are marked half as costly as errors 
across the low-granularity segmentation. 


Table 2 shows the weighted success rates observed in the 
experiments. 

Undifferentiated success rate: Based on the user study, 
we observed that the undifferentiated success rate, where all 
errors are penalized with the maximum weight without dis- 
tinguishing between the types of errors, was around 79.06% 
(first column in Table 2). 

Low-only success rate: On the other hand, when we fo- 
cus on only the errors in the first, low-granularity, step of 
the algorithm, we observed that the success rate jumped to 
87.31% (second column in Table 2). 

Differentiated success rate: When a differential penalty 
scheme (where errors within the high-granularity group — S, 
SF, and SG — are marked half as costly as errors across the 
low-granularity segmentation between same and new topics) 
is used, the success rate was 83.19% (last column in Table 2). 
Distribution of the errors: Table 3 provides a detailed 
tally of the types of errors (around 20% of all labelings as 
described above) observed during the user study. In this 
table, the columns correspond to the labelings chosen by 
the users, while the rows correspond to those assigned by 
the proposed algorithm. 

As can be seen by studying the last row of the table, 
which shows the aggregate number of the errors made by 
the proposed algorithm for each user labeling, the greatest 
percentage (40.9%) of labeling errors is due to messages that 
are marked same by the users. In fact, the biggest single 
contributor to the number of errors is the set of same topic 
messages that are labeled as new by the algorithm (31% of 
all errors). In the last column of the table, which shows how 
the errors are distributed among labeling of the algorithm, 
we see that 42% of all errors are due to messages that are 
incorrectly marked same, whereas around 40% of the errors 
are due to those that are incorrectly marked as new. The 
total contribution of specialization and generalization errors 
to the overall rate of the error is less than 20%. 
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Table 5: Success rates for individual labelings 
Success rate for labelings 
Same | Spec. | Gen. Overall 


0.86 0.64 | 0.64 0.79 


New 
0.74 


Table 6: The impact of quotations on the labeling 
performance 


Success rate (%) 
Quot. weights || Undiff. | Low-only Diff. 
Off 72.57% | 85.25% | 78.90% 
On 79.06% | 87.31% | 83.19% 


Note that, since the distribution of labels provided by the 

users is not uniform (Table 4), the impact of different types 
of errors on the overall success rate varies. Table 5 presents 
success rates achieved by the proposed algorithm for each 
label. The success rate achieved for those messages labeled 
new by the users is around 74%. The success rate is as high 
as 86% for detecting messages that stay within the same 
topic. The fine granularity segmentation success rate in the 
second phase is around 64%. As can be seen from the spec.- 
u and gen-u columns in Table 3, most of the errors in the 
second phase of the algorithm are due to messages that are 
marked same topic by the algorithm but further classified 
into specialization and generalization categories by the users. 
This shows that, while the human assessors can differentiate 
fine topic distinctions better, the proposed algorithm may 
conservatively classify messages to be of the same topic to 
prevent over-segmentation. The overall (undifferentiated) 
success rate is close to 80%, as described earlier. 
Effect of quotations: In order to observe the impact of 
the quotations on the performance of the segmentation algo- 
rithm, we calculated how the success rates changed when the 
context-sensitive weighting techniques proposed in this pa- 
per were turned off. When the quotations were not treated 
specially, the number of errors in the first step of the al- 
gorithm increased 11%, from 43 to 48 erroneous labelings. 
On the other hand, the total number of errors (including 
both phases of the algorithm) increased 30%, from 71 to 
93, showing that especially the fine-granularity differentia- 
tion required in the second phase benefits significantly from 
the way the proposed algorithm uses quotations for context- 
sensitive weighting (Table 6). 

In Table 3, we saw that 31% of the all errors were due 
to the set of same topic messages that were labeled as new 
by the algorithm. In order to see whether using a different 
impact factor formulation would improve this situation, we 
tried impact factors with different characteristics. A selec- 
tion of the low-granularity (same versus new) labeling errors 
are reported in Table 7. The first row of this table corre- 
sponds to the results presented so far. The following rows 
shows the results obtained when the impact factors were set 
such that the resulting keyword vector would have a higher 
similarity to the ancestor from which the quotations have 
been taken. The results show that, indeed, the number of 
same topic errors drops when the impact of the keywords in 
the quotations increases. However, this is accompanied with 
a significant jump in the number of new messages that are 
labeled as same, reducing the overall success rate as shown 
in the last column of Table 7. In fact, between the two ex- 
tremes (first and last rows) in the table, new message iden- 
tification (30 — 15 = 15) is more sensitive to the weight of 


Table 7: The effect of quotation impact factors on 
the low -granularity labeling performance 


undiff. 

imp(d) for d=1 || same new | new 5 same succ. 

| 0.5 28 15 79.0% 
| I 23 18 77.6% 
| 1.5 24 26 77.6% 
| 2 22 30 76.4% 


Table 8: Effects of different O, and O, thresholds 


| 0.0 | 0.25 | 0.5 | 0.75 | 1.0 || Exp. 
| Undiff. 0.21 | 0.33 | 0.56 | 0.77 | 0.74 || 0.79 
| Low-only. | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 || 0.87 
| Diff. 0.51 | 0.57 | 0.69 | 0.82 | 0.77 || 0.83 


quotations than same message identification (28 — 22 = 6). 
Thus, overweighting quotations does not help the overall 
success rate. 

The effect of threshold values: Finally, Table 8 shows 
the effect of various O, and ©, values on the final suc- 
cess rate. As expected (since it is insensitive to the fine- 
granularity segmentation), the low-only success is indepen- 
dent of the values of O, and ©; thresholds. Note that nei- 
ther too small nor too large values are good for proper seg- 
mentation. As we mentioned earlier, threshold values need 
to be set through a machine learning process which identifies 
proper values based on a given training sample. 


5. CONCLUSIONS 


Message threads evolve with new postings as new mes- 
sages may focus on or diverge from the original theme of the 
thread. In this paper, we presented algorithms for identify- 
ing how the hierarchical content of a discussion board grows 
through generalizations and specializations. This knowledge 
can be used in segmenting the message hierarchy into co- 
herent units to facilitate indexing, retrieval, and ranking, 
as well as in guiding users through segments that are rel- 
evant for their navigational goals. The segmentation algo- 
rithms are being deployed in a software system, called iCare- 
Assistant, which aims at reducing the navigational load for 
blind students in accessing web-based electronic course ma- 
terials through an unobtrusive, task-oriented, and individ- 
ualized delivery interface. However, we note that the tech- 
niques are equally applicable for developing web summariza- 
tion tools for users with sight. 
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